Mon. Not. R. Astron. Soc. 000, 000-000 (0000) Printed 5 February 2008 (MN I^TeX style file v2.2) 



Constraining black hole masses from stellar kinematics by 
summing over all possible distribution functions 
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ABSTRACT 

When faced with the task of constraining a galaxy's potential given limited stellar 
kinematical information, what is the best way of treating the galaxy's unknown dis- 
tribution function (DF)? Using the example of estimating black hole (BH) masses, I 
argue that the correct approach is to consider all possible DFs for each trial potential, 
marginalizing the DF using an infinitely divisible prior. Alternative approaches, such 
as the widely used maximum penalized likelihood method, neglect the huge degen- 
eracies inherent in the problem and simply identify a single, special DF for each trial 
potential. 

Using simulated observations of toy galaxies with realistic amounts of noise, I 
find that this marginalization procedure yields significantly tighter constraints on BH 
masses than the conventional maximum-likelihood method, although it does pose a 
computational challenge which might be solved with the development of a suitable 
algorithm for massively parallel machines. I show that in practice the conventional 
maximum-likelihood method yields reliable BH masses with well-defined minima in 
their distributions, contrary to claims made by Valluri, Merritt & Emsellcm. 
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1 INTRODUCTION 

A fundamental application of stellar dynamics is using ob- 
servations of a galaxy's kinematics to constrain its poten- 
tial ^{x). The galaxy is normally assumed to be coUisionless 
and in a steady state, so that the dynamics of any popula- 
tion of stars are completely described by its phase space 
distribution function (DF), f{x, v), which is the probability 
density of finding a star in the small volume of phase space 
around {x,v). By Jeans' theorem, this DF can depend on 
(a;, v) only through the integrals of motion of the (unknown) 
potential. 

Most approaches to this task begin by considering 
the simpler problem of constraining / given the observed 
data and some trial The infinite-dimensional DF is 
parametrized by a finite sum of delta functions (e.g., 
ISchwarzsc h'iid''l97g|) or a truncated basis fu nction expan- 
sion re.g.. .DeiongharmS^: ISag-lia et aUEnwih. and the DF 
parameters are adjusted to optimize the fit to the observa- 
tions. If no set of parameters yields an acceptable fit while 
simultaneously representing a DF that is everywhere non 
negative, then the assumed potential can be ruled out. 

This basic id e a ca n be refined further. 
iRlchstone fc Tremainel l)l988h pointed out that naive 
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application of this method will yield unrealistically spiky 
DFs, leading them to advocate the use of entropy (or some- 
thing similar) as a regularizer. This idea of regularizing the 
resulting DFs was made more explicit by Merritt (1993j), 
who cast the problem as one of finding the maximum 
penalized log-likelihood, C' = — ^X^ + -^^[/]) which allows 
a trade off between goodness of fit, as measured by , 
and smoothness, as measured by the penalty function P. 
The choice of penalty function and the value to use for the 
tradeoff parameter A are arbitrary and subjective. Given a 
range of trial potentials, one finds the maximum penalized 
likelihood for each and then uses normal statistical meth- 
ods to make statements about how well constrained the 
potential is. This general approach has become the method 
of choice in stellar-dynamical searches for supermassive 
black holes (hereafter BHs) in galaxy centres, with choices 
of penalty function ranging fro m the mean-square s econd 
derivative of the DF (e.g., Ivan der Ma rcl et al.' "199^ 
fcapp cUari et alJ 2002]) or entropy (e.g., .Gcbhardt et all 
I2OO3I: ISfige et alJ 12005') through to models in which 
no regularization whatsoever has been applied (e.g., 
Ivan der Marel et aljlT99l iHouehton et ahl EoOe). I refer to 
these as "maximum-likelihood" or "maximum-penalized 
likelihood" methods. 

Another way to look at the problem of constraining 
potentials is to treat it as a straightforward mathemati- 
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cal inverse problem. iDeionghe fc Merritd il992h considered 
the case of constraining the potential of a spherical galaxy 
given perfect knowledge of its projected DF in the form of 
its line-of-sight velocity profiles (hereafter VPs) L(_R;i;p). 
By using a set of higher-order Jeans equations they showed 
that, given the potential, the DF f{£, J^) is completely de- 
termined by its projected VPs. Constraining the potential, 
however, proved less amenable to their methods, but they 
found that choosing a potential too far from the true one 
would yield moments (and therefore DFs) that became neg- 
ative, ruling out that potential. 

Of course, one never has perfect knowledge of the full 
projec ted DF. More recently, IValluri. Merritt. fc EmsellemI 
yOOJ, hereafter VME04) investigated the slightly less ideal- 
ized problem of constraining the BH mass in a toy axisym- 
metric galaxy given noiseless measurements of a restricted 
number of (modified) moments of its VPs averaged over a 
number of spatial bins on the sky. They showed that even 
when the potential has just one free parameter - the BH 
mass - there are many different potentials that can fit the 
available kinematics almost perfectly, even when the central 
spatial resolution of the kinematics is much finer than the 
BIf's sphere of influence. 

One thing that all these methods have in common 
is that they consider just one DF for each trial poten- 
tial. This is fine for th e idealized case considered by 
IDeionghe fc Merritd (llOflST) : given perfectly resolved, noise- 
less projected VPs of a spherical galaxy there is a unique DF 
for any assumed potential, even though this DF may not be 
non-negative everywhere. But when the available data have 
finite spatial and velocity resolution, there will in general be 
many perfectly sensible, non-negative DFs that yield equally 
good fits to the data, and even more DFs producing fits that 
are only slightly worse. Intuitively, one might expect that the 
more such DFs a potential admits, the more likely it is. 

The purpose of the present paper is to revisit the 
the problem of constraining BH masses from a thoroughly 
Bayesian perspective, showing how it naturally incorporates 
this intuitive notion of counting up DFs. To illustrate the 
ideas, I use simulated observations of some idealized spheri- 
cal toy galaxies described in Section 2, modelling them under 
the assumptions given in section 3. In section 4 I test how 
well the conventional maximum likelihood method recovers 
BH masses and counter some of the more pessimistic con- 
clusions of VME04. Section 5 presents a Bayesian approach 
to the problem, which overcomes some of the inconsisten- 
cies of the maximum likelihood method. Finally, section 6 
sums up and discusses the implications for BH masses in 
real galaxies. 



2 TOY GALAXIES 
2.1 Intrinsic properties 

My toy galaxie s are spherical with lu minosity density profile 
JPehncn 199 j: iTremaine et al.lll99i) 



jir) = 



(3-a)L 



47r 



(1) 



and constant mass-to- light ratio T for radii r > 0, so that the 
total stellar mass = TL. At r = there is a BH of mass 
M, — 2x 10"'' M*. The galaxies used in this paper all have 



inner density slope a = 1.5, for which the BH dominates the 
kinematics inside a rad us 0.015a. 

By Jeans' theorem jBinnev fc Tremain(l l98?). a spher- 
ical galaxy can be in equilibrium only if its phase-space dis- 
tribution function (DF) depends on {x,v) only through the 
integrals of motion £ and J , the energy and angular mo- 
mentum per unit mass. T he DFs of the toy galaxies have 
the form dCuddefordlliigill 

-2/3, 



f{£,r) = J-^^g{£), 



(2) 



where the parameter (5 controls the degree of anisotropy, 
with (3=1 — cr^/cr^ . I so lve for q(£) given j{r) and M. using 
the method described in lMagorrian fc'Tremaind lll999f) . and 
present results for both isotropic (/3 = 0) and mildly radially 
anisotropic (/3 = 0.3) toy gala^cies. 



2.2 Observables 

The standard "observations" of each toy galaxy consist of its 
luminosity-weighted VPs averaged over abutting shells, with 
5 shells per decade in radius whose centres run from -Rmin = 



lO'^^a to Ru 



4a. These observations both resolve the 



BH's sphere of infiuence and extend to more than twice 
the galaxies' effective radii . I calculate VPs L(_R ;tip) using 
the procedure described in Ivan der Marel et al.l 12000.) and 
parametrize each using a Gauss-H ermite series iCerhardl 
Il993l:lvan der Marel fc Franxll 19931) . 



Lgh(-R; lip) 
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V 



(3) 



This expresses the VP as an underlying Gaussian with nor- 
malization 7, mean V and dispersion a, modified by a sum 
of Hermite polynomials Hi. For any reasonable choice of 
(7, V', a) it is straightforward to show that choosing 



hi 



^7 



exp 



V 



V 



dv 



(4) 



minimizes the mean-square deviation between L(_R;i;p) and 
Lgh{R',Vp). Therefore, the hi are simply modified moments 
of h{R;vp). I choose {j,V,a) to be the parameters of the 
best-fitting Gaussian to L{R;vp), in which case V and the 
odd hi are zero, {ho,h2) = (1,0) (by equs. |3 and 2] above) 
and hi measures the lowest-order departure of the VP from 
Gaussianity. 

Each realization of a toy galaxy then consists of 19 
VPs. I expand each VP about its underlying best-fit Gaus- 
sian (7, 1/, (t), and reduce the VP to four "measurements": 
the mean surface brightness /, and the three lowest-order 
luminosity-weighted modified moments {Iho, Ih2, Ih^). To 
these I add independent, normally distributed errors AI = 
10"^/, (A/io, A/i2, A/i4) = (0.02,0.05,0.05), comparable to 
the formal errors from observations of real galaxies. Notice 
that the parameters of the underlying Gaussian (7, V, a) are 
chosen by me, not measured, and so have no measurement 
uncertainties. 

Some further com ments on the use o f the modified mo- 
ments are in order. iHoughton et alJ ll2006fl have shown 
that Gauss-Hermite expansions are not particularly well 
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Figure 1. Gauss— Hermite coefficients of the line-of-sight VPs of 
the isotropic (soUd curves) and anisotropic (dotted) toy galaxies. 

suited for parametrizing the VPs of real galaxy centres: 
real VPs can be very strongly non-Gaussian, and in prac- 
tice measurements of the coefficients hi are not indepen- 
dent, even for fixed (7,^,0-). I nevertheless use the Gauss- 
Hermite parametrization in the present paper for the fol- 
lowing reasons: truncated fourth-order Gauss-Hermite ex- 
pansions turn out to provide reasonably accurate fits to the 
VPs of the toy galaxies; real observations have finite veloc- 
ity resolution, which is mimicked, at least qualitatively, by 
truncating the infinite Gauss-Hermite expansion (e.g., com- 
pare the eigen-VPs in fig. 6 of Houghton et a, l. (2006) w ith 
the G auss-Hermite basis in fig. 1 of Ivan der Marel fc Franxl 
il993l) ): finally, using Gauss-Hermite expansions permits a 
more direct comparison with VME04's method and results. 



energy £i. To avoid a rash of indices I henceforth write the 
double sum JSJ as a single sum over n = ne x nj points: 

n 

/(£:,j2) = ^/,5(£:-£:05(J'- Jf). (7) 
1=1 

This discretization effectively partitions phase space into 
abutting rectangular cells, with the luminosity contained in 
each cell being given by 

L, = fJ g{£,J^)A£Aj\ (8) 

where g{£,J^) is the density of states for the potential ip 
and Vi the volume occupied by the cell. In section IFITl it will 
provide convenient to use a dimensionless luminosity 

F. = (9) 

where Ls is a characteristic luminosity scale. 

The models' projected observables I{R), Ihi{R) depend 
linearly on the orbit weights ft, so that the of model 
with DP / = (/i, . . . , /n)"^ is the quadratic form 

x^fW = [Q PW ■ ff ■ [Q - PW ■ f] , (10) 

where 

/ I{R^) IhojR,) _nuiRN)Y , . 

^ - \AI{Rr)' AlhoiRi)'' ' ' Alh^RN)) ^ ' 

is a column vector containing the list of observations, nor- 
malized by their uncertainties, and P(^) is a projection ma- 
trix whose n columns contain the contribution each DP com- 
ponent makes to the model's prediction for Q. The calcula- 
tion of P(^) is described in Appendix A. 



3 MODELS 

The general modelling scheme is the same as that used in 
iHoughton et al.| i2006h . The model potentials ^p{r) have two 
free parameters, the BH mass M, and the mass-to-light ra- 
tio T, corresponding to mass densities of the form 

M 

Pir) = ^5{r)+Tj{r), (5) 

where j{r) is given by (Q. Since the kinematics of the 
toy galaxies are averaged over abutting shells and ex- 
tend to many effective radii, it turns out that T is 
very well constrained by v irtue of the virial theorem 
iRichstone fc Tremainelll988l) . So, for the results presented 
in this paper I simply fix T at its correct value. This means 
that model and galaxy potentials differ only in their BH 
masses. 

Having the potential tp, I discretize the DP 

f{£, J') = E E f'^^(^ - ^')^(-^' - 4). (6) 

i=i j=i 

on an ng x nj regular grid in phase space. The points £i are 
chosen through £i — ipiri) with the spaced logarithmically 
between 10~^a and lO'^a. There are nj values of angular 
momentum for each £i, with jfj running linearly between 
and J^{£i), the angular momentum of a circular orbit of 



4 FITTING MODELS TO OBSERVATIONS: 
THE MAXIMUM-LIKELIHOOD METHOD 

Having a set of observations of a toy galaxy, let us 
first test how well a simplified ve rsion of the stan- 
dard maximum-likelihood me t hod (e.g. .[van der Marel et al, 
I1998S; 'Gebha rdt et al.l 120031: IValluri. Merritt. fc EmseUem 
2004; Hought onet al.l2006l) reproduces the correct BH mass 
and its uncertainties. The procedure is as follows: 

(i) choose a trial BH mass AI, and calculate the corre- 
sponding potential tp; 

(ii) calculate the projection matrices P{ip) appearing 
in itlTTl : 

(iii) use a non-neg ative least-squares algorithm 
iLawson fc Hansmil Il974^ to find Xmin(V')i the mini- 
mum value of (lion subject to the constraint that all 

(iv) assign a likelihood exp[— ixmin(^)] to the poten- 
tial tjj. 

One obtains constraints on M, by considering a range of M, 
and comparing their relative likelihoods. To keep the inter- 
pretation of the results as simple as possible, I do not impose 
any regularization on the fi. Section r4.3l below discusses this 
further. 
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Figure 2. distributions returned by the conventional 

maximum-likelihood method (section|4} for noiseless observations 
of an isotropic toy galaxy (solid curve) and for two different re- 
alizations of observations with simulated noise (dashed and dot- 
dashed curves). 



4.1 VME04's flat-bottomed x^i.M.) distributions 

One of the more alarming conclusions reached by VME04 
was that the problem of constraining BH masses is inher- 
ently strongly degenerate; they found that a wide range of 
BH masses could provide equally good fits to mock kine- 
matics with realistic spatial resolution. The main goal of 
this section of the present paper is to understand this re- 
sult and to investigate whether its implications really are as 
negative as VME04 suggest. 

The solid curve in figure |5| plots x^{M,) obtained for 
models with ng x nj — 100 x 10 DF components when ap- 
phed to noise-free observations of the isotropic toy galaxy. It 
demonstrates that VME04's central result also holds for the 
simpler spherical case considered here: a wide range of Af, 
can produce perfect fits to perfect noiseless data. For BH 
masses in the range 1.8 X 10"^ < M./Mgai < 3.0 X 10'^ x' 



is of order 10"^^ rising to ~ lO"'^ for M./Mgai = 1.7x 10" 
or 3.1 X 10^'^. A model with no BH can produce kinemat- 
ics that differ by only a small amount (Ax^ ~ 2.4) from 
the toy galaxy's. Of course, these values of a-re statisti- 
cally meaningless since they do not account for the fact that 
the observations in this contrived situation have zero uncer- 
tainty; the increase of to 10~^ from its minimum value of 
10"^^ (which is zero to machine precision) is actually very 
significant. 

The other two curves in figure |21 show the results of 
adding two different realizations of noise to the simulated 
dataset. This makes x^i^*) become nicely rounded, sim- 
ilar to what one finds in models of real gala xies (e.g., 
Ivan der Marel et aljUooi lOebhardt et alJl2003t l. VME04, 
however, only presented results for the noiseless case. 

These results can be explained by remembering that, 
for a fixed potential, x^ is a quadratic form IjlO^ in the or- 
bit weights /. Since the number of unknowns is very much 
less than the number of observations, this quadratic form 
is hugely degenerate: it resembles more a multi-dimensional 
trough than a parabola. If we relax the constraint that all 



fi ^ 0, then it turns out that for all the models consid- 
ered here - independent of the value of M, - the value of 
X^ at the bottom of the trough is zero (to machine pre- 
cision); the non-negativity constraint is essential for con- 
straining the BH mass. Of course, for the correct model with 
M, = 2 X 10"'^, the bottom of the trough passes through 
the discretized version of the true DF Q, which is well away 
from the boundaries given by fi ^ 0. Then making a small 
change in the trial Al, leads to a small change in the projec- 
tion matrix, particularly for those fi corresponding to the 
most tightly bound orbits, and therefore changes the shape 
of the quadratic form slightly as well as the location of its 
minimum. Changing the potential too much moves the lo- 
cation of the minimum to a region where at least one of the 
weights becomes negative, so that the minimum value of 
in the subvolume /i ^ is no longer zero. 

Adding noise simply shifts the centre of the quadratic 
form, with no change in its shape. For realistic amounts of 
noise, the centre is shifted well into the region where many 
of the orbit weights are negative, leading to the rounded 
X^(M.) profiles. 



4.2 The effects of signal to noise on the 
uncertainties on A/, 

In order to examine this more quantitatively, let us consider 
how the uncertainties on Af, depend on the signal-to-noise 
ratio of the simulated data. To do this, we need a method 
of quantifying the uncertainty on Af. . The accepted prac- 
tice in this field is to assume that the Ax^ = 1 boundaries 
of Xmin(-^») give reliable i ndicators of the 68 percent confi- 
dence limits on M, (e.g., Ivan der Majel_et_al| ^9^. This 
is based on the assumption (e.g!~ IPres^^L ^^92) that 
Xmin(-^«) is close to quadratic and therefore that the prob- 
ability distribution exp(— Xmin/2) is almost Gaussian, but 
the results above and in VME04 show that this assumption 
can be far from the truth. So, throughout this paper I use 
the mean and variance, 



M. = A J M, exp 
{AM.f = a[ (Af. ~ M^f exp 



ixmin(M.) 



dM. 



(12) 



dM., 



to quantify the best-fitting A/, and its associated 
uncertainty, the quantity A being chosen to make 
A J exp[-Xmin] dAf. = 1. I find that this AA4. agrees well 
with the (correctly calculated) 68 percent confidence inter- 
vals for the following. 

Figure O shows how Af. and AAf. vary as one changes 
the size of the observational uncertainties for one particular 
noise realization AQ, the errors (A///, A/io, A/12, A/i4) = 
sAQ being shrunk by a factor s with respect to the "stan- 
dard" observational errors. The following points hold for 
typical noise realizations AQ: 

(i) Given noiseless data (s = 0), a range of Af. spanning 
1.2 X 10~''Afgai can produce perfect fits. This is essentially 
the "Ax^ = 1" measure of the uncertainty on Af. applied 
to a uniform distribution. 

(ii) The variance of a distribution uniform for x G [a, b] is 



given by 



"'/12. Therefore, the more useful estimate of 
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Having examined the effects of observational uncertain- 
ties on the uncertainty on M,, let us now turn to the sim- 
pler question of whether the maximum-likelihood method 
yields estimates of M, close to the true BH mass. Taking 
the s — 1 situation of realistic noise and averaging over 
many hypothetical datasets, the typical formal error in M, 
is about 0.8 x IQ-^Mgai (isotropic galaxy) or 1 x lO^^Mgi^i 
(anisotropic galaxy), both increasing only slightly as the 
number of DF components used increases (Table 1). The 
maximum-likelihood method yields fairly strongly biased es- 
timates of M, for the anisotropic galaxy, which nevertheless 
are well within the formal uncertainties. 



Figure 3. Effect of relative signal-to-noise ratio, on the BH 
mass M, returned by the maximum-likelihood method for a typ- 
ical realization of the noise in the observations. The solid curves 
in the top panel plots the mean M, and its formal uncertainty 
(eg. 1121 as a function of s, the relative size of the observational er- 
rors. For comparison, the dotted curves show the uncertainties on 
Mm returned by the widely used Ax^ = 1 criterion. The bottom 
panel plots the corresponding minimum value of x^- The s = 1 
case corresponds to the dot-dashed curve in figure l2l 



Table 1. BH mass estimates for toy galaxies using the conven- 
tional maximum-likelihood method 
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100 X 10 
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200 X 20 


10-3 


2.55 


0.3 


100 X 10 


10-4 


2.08 



(A(Af./10-3M, 



gal 



4.3 The appropriateness of regularization 

Apart from the strange dependence of the formal uncer- 
tainty AM, on the signal-to-noise ratio, perhaps the most 
telling feature of models obtained using the maximum- 
likelihood method is how well they fit: they are too good 
to be true. While adding realistic amounts of noise to the 
observations removes the flat bottom in x^, the value of 
at the minimum remains very much less than the number 
(76) of observed data points (e.g., figure |2J. These fits are 
implausibly good; the chances are tiny that the actual val- 
ues of (/, ho,h2, hi) in the real galaxy are all so close to the 
observed estimates. Furthermore, the models achieve this 
level of fit by having only ~ 70 of the fi greater than zero: 
internal kinematics of the model are very irregular. It 



0.81 
0.85 
0.34 

0.99 
1.06 
0.42 



the uncertainty given by equ. 11211 is a factor \/T2 smaller, 
or 0.35 X 10-3 Afgai. 

(iii) One can obtain perfect fits to the data for any 
s < 0.1, the precise upper bound on s depending on the 
particular noise realization. 

(iv) Around the value of s where the best-fitting starts 
to lift off from zero, the uncertainty AM, drops as s in- 
creases. 

(v) Overall, AM, grows slowly with s; it grows by only 
a factor ~ 1.5 between the noise-free s = and the more 
realistic s = 1 situation. 

At first sight the last point might seem to suggest that 
there is little point in obtaining very high signal-to-noise 
observations, but of course one could extract higher-order 
information on the VPs from such observations, such as he, 
or its equivalent, and, in some cases, one could also use finer 
spatial binning. Both of these would act to reduce the de- 
generacy in M. for s — 0. 



Columns are: the galaxy's anisotropy (/3), the number of orbits 
used in the modelling (ng X nj) and the innermost extent of 
the projected kinematics (Rmin); the mean BH mass from many 
realizations (Af.) and the typical formal uncertainty in M,. 



is important then to consider how M, is affected when one 
includes mode ls that y i eld mo re plausible fits to the data. 

Following Mer ritti ilQQSl). the approach advocated by 
VME04 and subsequently bv lCretton fc EmsellemI ||2004|) is 
to regularize the DF, finding for each M, the {fi} that 
maximize a penaUzed log-likelihood, -fx^ -I- AP[/]. The 
pena lty function P[f] provides some arbitrary measure of 
the smoothness of the DF and the parameter A is set by 
how much one is willing to trade off goodness-of-fit for a 
smoother DF. Although beguiling, this approach is only 
marginally better than the conventional maximal likelihood 
method used above, because both 

(i) identify a single privileged "best" DF; 

(ii) and then take this DF to be representative of all of 
the DFs for the assumed potential. 

The first step is fine (at least for certain applications), but 
the second is wholly unjustified and ignores the fact that 
X^lf] is hugely degenerate. 

To see a variant of this problem in a much milder con- 
text, consider how one measures M, in real galaxies. The 
potential then has at least one additional free parameter, 
the mass-to-light ratio T, and one has to construct a grid of 
models for a range of different values of M, and T. The un- 
certainties on M, are never obtained by picking out a special 
value of T for each M. ; instead one marginalizes T either 
explicitly (e.g.. iGebhardt et aljl20o3) or implicitly through 
the use of a Ax^ crite rion fe.g.^ Tvan der Marel et aDll99Sl : 
ICappellari et alJl2002l) . This idea of marginalization is key 
to resolving the issues noted above. 
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5 FITTING MODELS TO OBSERVATIONS: A 
BAYESIAN APPROACH 

The maximum-likelihood approach of the previous section 
pays scant attention to the orbit weights fi, which serve 
merely as co-ordinates used to locate a point somewhere 
along the degenerate minimum in x^- From a Bayesian point 
of view, however, the fi are nuisance parameters. Although 
we are not interested knowing their precise values, they de- 
serve to be treated on an equal footing with the parameters 
defining the potential. 

Applying Bayes' theorem twice, the posterior probabil- 
ity of a model with potential and a set of orbit weights /, 



that 



p(i^,/|Z?)(xp(i3|/,V)p(/|^)p(^), 



(13) 



where p{D\ip,f) — exp[— ^x^(/l''/')] is the usual likelihood 
and the priors p(/|i/') p(V') will be discussed later. Since 
we are not interested in the values of weights (as long as they 
are non- negative), let us marginalize 1131 to obtain 



p(V'|D) oc p(DiV)p(^), 

where the marginalized likelihood. 



p(^IV') 



p{D\f,^)p{f\i,)df 



exp 



L 2 
^Xmarg 



(14) 



(15) 



is obtained by summing the likelihood over all non-negative 
DFs, each weighted by the as-yet-unspecified prior p{f\ip)- 
Notice that this is directly analogous to the partition func- 
tion in statistical mechanics, with pla-ying the role of en- 
ergy and the prior standing in for the density of states. The 
conventional maximum-likelihood method of the last section 
can be viewed as the very crude approximation 



p(L»|^) ~ max^p(D|/,7/>) 



exp 



1 

2 TSo 



min [f] 



(16) 



obtained by completely ignoring the prior p{f\il') and ap- 
proximating the remaining integral by the peak value of its 
integrand. There is a straightforward and obvious analogue 
for the maximum penalized likelihood method. 



5.1 The priors 

There is nothing noteworthy about the choice of the poten- 
tial prior p('(/') for the situation considered here. The po- 
tential has one free parameter, M,, which can be zero or 
positive, meaning that the natural prior to use is fiat in 
logM.. 

The choice of prior for the DF, p(/|^/'), is more inter- 
esting. Recall that we use discrete cells (|7|l to model con- 
tiimous phase space. Now, there is no a priori natural way 
to partition phase space into cells. Let us assume for the 
moment that there are no correlations among cells and that 
the prior is independent of location in phase space. Let fit be 
the prior expectation value for the dimensionless luminosity 
Fi = Li/ Lb (eq. O in cell i; it can therefore be thought of 
as measure of the cell's volume. Then a natural requirement 
on the prior is that, given a partition tt of phase space, we 
should be able to select any cell and construct a new parti- 
tion tt' by subdividing this cell into m > subcells and have 



p^{F\ii,^)= dFi-- dF„,5{Fi + ---+Fn-F) 
Jo Jo 

X p^/(Fl|^l,'!/') X • ■ • X p^,{Fm\^J.m,i'), 

(17) 

where the cell volumes satisfy -|- • ■ • -I- Hm = That 
is, marginalizing over the subcells should return the original 
prior. A consequence of this is that, for finite-resolution data, 
the marginalized likelihood p{D\il)) is independent of the 
chosen partition provided only that one uses fine enough 
cells. 

The infinite-divisibility (hereafter ID) condition 1171 
puts strong constraints on the form of the prior. It is not 
satisfied by any of the simplest commonly used priors 



p(i^lV) 




-aF In F] 



(uniform) 
(Jeffreys), 
(entropy). 



(18) 



The form of the convolution in eq. (1171 suggests that Laplace 
transforms might be helpful in finding ID priors, and indeed 
it can be shown llFelle rl IToTll . §XIII.7) that a probability 
distribution p(_F|/i) satisfies 1171 if and only if its Laplace 
transform 



p(s|/i) 



dFe""'^p(Fj^i) 



is of the form 



p(s|/i) = exp 



F 



-M{p,dF) 



(19) 



(20) 



with the only constraint on the measure A4{fi,dF), known 
as the Levy measure, being that the integral 



F~'X(/i,dF) 



(21) 



converges for all e > 0. Thus we can construct priors that 
are guaranteed to be ID simply by considering a variety 
of choices for M. For example, substituting M{fj.,dF) = 
IJ.5{F—1) dF in II2UII results in the Possion distribution, while 
M{n,dF) = fie-'^dF gives the gamma distribution. Both 
of these obviously satisfy the ID criterion 1171 . 

A concise and very readable introduction to this subject 
IS given by (filling (199i). He argues that the maximally 
ignorant choice of A4 when one knows only a characteristic 
scale for F is 



M{fi,dF) = fiFe'^ dF. 
This results in the so-called 



'massive inference" prior 



p{Fi\fj.i,ip) 



(22) 



(23) 



where /i is the first-order modified Bessel function of the 
first kind. Appendix B gives an elementary derivation of 
this prior, explaining how it is the natural generalization 
of entropy to continuous distributions. I adopt it for the 
calculations below. 

There remains the question of what to choose for the 
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Figure 4. The differential energy distribution N{£) (equ. 1251 
for isotropic (solid curve) and anisotropic (dashed) toy galaxies, 
along with their difference (bottom panel). r{£) is the apocentre 
radius of a perfectly radial orbit with energy £. 



prior weights fii. From the Laplace transform I2UII it is 
straightforward to show that the prior mean and variance, 



var(F) = 



M{p,dF), 
FM{n,AF), 



(24) 



and therefore that the prior 12311 likes to have F ^ 
In the absence of a compelling model of galaxy formation , 
I simply use the well-known fact ijBinnev fc Tremainelll98^ 
that for reasonable galaxy models the differential energy dis- 
tribution, 



/(£:,J2 )(?(£:, J2)dj2 



(25) 



is almost independent of anisotropy once the galaxy's lumi- 
nosity density j(r) and potential ^(r) have been specified 
(e.g., see fig.BJ- For each trial potential I find the isotropic 
DF /iso(f ) that produces the luminosity density Q and, 
following (|HJ and Q, assign 



Mi = 



(26) 



This depends on the choice of characteristic luminosity scale 
Ls (equ.|3Jl. Now, the prior RMS fractional spread in each 
cell is equal to ^J^fjii, which clearly depends on Ls and 
has a Poisson-like N^^^'^ dependence on the cell volume.^ 
In order to make the prior variance independent of the par- 
titioning scheme used to represent the DF, I introduce a 
second, independent reference partition and choose 



2 

52 



(27) 



so that the prior RMS fractional spread in each reference 
cell is given by the adjustable parameter 5. For the results 



^ This is inevitable for any ID distribution, which can be seen 
either from 1241 or from the fact that any ID prior is a li mit of a 
sequence of compound Poisson distributions lFelledll97ll) . 



presented here I use the partition defined by the ne x n,j — 
100 X 10 grid for this reference partition. 

5.2 Marginalization 

Although one could attempt the difficult task of evaluating 
the marginalized likelihood p{D\ip) (eg. 1151 directly, we are 
not so much interested in the absolute value of p{D\^) as in 
the odds, 

p(^ii-P) ^ p(giV'i)p(V'i) 

p{i,o\D) p(Z?iV'o)p(Vo)' ^ ' 

of one potential tpi compared to another tpo- I evaluate the 
ratio p{D\^pi)J_2U^j^l^o) using the method of thermodynamic 
integration jNea<ll993l) . Consider two models, one having 
potential tpo, the other with a slightly difi'erent potential t/ii, 
but both having ne x nj DF components chosen accord- 
ing to the scheme described in section 3 for their respective 
potentials. Let 



exp[C7A(/)]d/ 



where 



1 
2 

+ (1 



= -- [(1 - \)xHf\i'o) + Ax'(/|^i)] 
A) lnp(/|Vo)+Alnp(/l7/.i). 



(29) 



(30) 



As the parameter A varies between and 1, Z\ interpolates 
smoothly between p{D\tpo) and p(_D|i/;i). Taking the loga- 
rithm of 12911 and differentiating with respect to A, 



dA 



log Zx 



ACx 
dA 
ACx 
dA 



1 



exp[CA(/)]d/ 



( — ) 



(31) 



A' 



where {C)\ denotes the expectation value of C{f) when / 
has probability density exp[CA(/)]/^A- Therefore we can use 
a Markov-Chain Monte Carlo method to draw points from 
this density, and taking the mean value of ACx/AX for these 
points gives an immediate estimate of Alog Z\/A\. Then, 
integrating. 



dA- 



dA 



log Zx = lo; 



Zj_ 

Zo 

' 2 



= log 



(^l) 



p(g|V^i) 
pPIVo) 



(32) 



2 

Xmarg 



(V'o)) • 



This method will be reasonably efficient only if the depen- 
dence of Ca(/) on / does not vary significantly as A changes, 
which is the case for the choice of /; using the scheme de- 
scribed in Section 3. 

For the results presented below I use a Gibbs sampler to 
draw 5 x lO'^ points from exp[CA(/)] after a burn-in period 
of lO'' iterations starting from F = fi. Instead of evaluating 
the derivative 13H for a range of fixed A I instead increase 
A slowly from to 1 over the course the iterations, yield- 
ing logp(-DIV'i) — logp(-DIV'o) directly. A direct test of this 
procedure is to run it backwards by swapping the potentials 
around and calculating log p(Z)|i/)o) — log p(D|'!/'i). I find that 
that both forward and backward iterations typically agree 
very well, provided S < 20. For larger 5, the delta function in 
the prior 1231 becomes dominant and the posterior becomes 
effectively stuck at _Fi = for a significant fraction of the 
DF components. 
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Figure 5. The importance of infinite divisibiUty. The left panel plots the derivatives of the marginalized log-likelihood Xmarg(-'^») ' 
— 21np(Z)|JVf») calculated using equ. l32l for a non-ID flat prior. Solid, dashed and dotted curves correspond to models with ng X nj - 
100 X 10, 200 X 10 and 200 X 20 DF components, respectively. The panel on the right plots the same for the ID prior 1231 with (5 = 5. 
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Figure 6. The dependence of the formal uncertainty in BH mass 
on the fractional variance used in the prior. 
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Figure 7. Marginalized likelihoods for one realization of the 
anisotropic toy galaxy. The solid curves plot p(D|M.) using pri- 
ors with (5 = 5, 10 and (heavy solid curve) 20. For comparison, 
the dashed curve plots the corresponding probability distribution 
returned by the conventional maximum-likelihood method. 



5.3 Results 

Here I present results of applying the Bayesian analysis to 
observations of the anisotropic toy galaxy. The resuhs for 



the isotropic galaxy are similar but less instructive since I 
use the isotropic case to assign the prior weights fi. 

Figure 13 demonstrates the importance of using a prior 
that satisfies the ID criterion 1171 . Taking a (non-ID) prior 
flat in the orbit weights / leads to a marginalized likeli- 
hood p{D\^l>) that depends on the number of orbits used 
and, more generally, on exactly how one discretizes phase 
space. In contrast, the marginalized likelihood p{D\M,) for 
the ID prior I23II does not depend on whether we discretize 
using nsxnj = 100 x 10, 200 x 10 or 200 x 20 cells. Inciden- 
tally, this also shows that, for the present purposes at least, 
it is acceptable to use delta functions (eq.[7|l to calculate the 
contribution each cell makes to the observations. 

Although the marginalized likelihood is independent of 
the discretization, it still has one free parameter, the frac- 
tional variance per reference cell S^. To investigate the de- 
pendence of the results on 5, I use the mean and variance of 
the posterior distribution p{M,\D,5), 



M.{S) 

{AM.{S)f 



M.p{D\M.,S)p(M.) dM. 

[M. - M.(5)) ^ p{D\M. , S)p(M,) AM.. 



(33) 



Figure El shows how the formal uncertainty AM. varies with 
5 for a typical realization of the observations. As one might 
expect, AM, is low for small 5 ~ 1, but grows rapidly 
as 6 increases. Once 5 reaches ~ 10 though, AA/, stabi- 
lizes at about 0.4 x 10~^ Mgai. The marginalized likelihoods 
p(_D|M,) used in calculating AM, for the three largest val- 
ues of 5 plotted are shown in Figure |2| which shows that 
there is little change as one increases 5 from 10 to 20. It 
is somewhat surprising that 5 needs to be so large. This 
might be a consequence of the power-law dependence of the 
DF (eq. HJ on J. For comparison, the figure also includes 
the probability distribution returned by applying the con- 
ventional maximum-likelihood method to the same dataset. 
Taking many realizations of the galaxy, the Bayesian method 
yields a mean AM, of 0.44 x 10~''Mgai compared to the 
0.99 x 10~'^Mgai returned by the conventional method (ta- 
ble 1). 
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6 CONCLUSIONS 

Most BH mass estimates come from the conventional 
maximum-likelihood method. They fit observations using 
models with specially constructed, unrealistically spiky DFs, 
yielding implausibly good fits to the observations. Results 
from the toy galaxies considered here suggest that the (un- 
penalized) maximum-likelihood method nevertheless does 
yield reliable BH masses, but with overly pessimistic error 
estimates. 

I confirm VME04's result that a wide range of BH 
masses can yield perfect fits to finite-resolution, noiseless 
observations. Contrary to their somewhat speculative argu- 
ments, however, I show that one does not expect to find flat- 
bottomed distributions in practice, unless one is blessed 
with very high signal-to-noise observations (figure|2l and fits 
only the low-order shapes of the VPs. 

Although the max;imum-likelihood method yields reli- 
able BH masses, it is flawed because it considers only one 
DF for each BH mass. The remedy is to consider all possi- 
ble DFs for each potential, weighting each one by a suitably 
chosen prior. This significantly improves the constraints on 
the BH mass, since the closer a trial potential comes to the 
true potential, the greater the number of non-negative DFs 
that are consistent with the observations. 



6.1 Open questions and future work 

6.1.1 Choice of prior 

An open question is to what extent these results depend 
on the choice of prior weights (equ. I26II . and more gener- 
ally on the use of the prior 12311 . which is just one of many 
possible infinitely divisib le distributions. A very general ar- 
gument jKingmanllliol shows that drawing random real- 
izations from most ID priors will yield spiky, uncorrelated 
distributions. This is just what one expects if dealing with 
galaxies at the level of individual stars, but the present mod- 
els are far from this level of detail and one might plausibly 
expect some degree of correlation among neighbouring cells 
in phase space. Making this idea quantitative is difficult, 
however. 



6.1.2 Computational scheme 

It would be straightforward in principle to apply the ideas 
presented here to axisymmetric or triaxial galaxy models. 
One could s imply adopt the prio r (1231 . using the scheme 
described bv lThomas et all (|200^ to calculate the volumes 
used to assign the prior weights jj,. In practice, however, 
calculating the marginalized likelihood p{D\ip) will proba- 
bly be very difficult. The DFs of axisymmetric galaxies are 
three- integral, which means that, in order to ensure a fine 
enough discretization, one has to use many more DF com- 
ponents than for two-integral spherical models, with the 
Markov Chain Monte Carlo procedure used in section 15.21 
taking correspondingly more iterations to converge. On the 
other hand, it is very likely that there are much more effi- 
cient methods than the combination of Gibbs sampling and 
thermodynamic integration used here. 



6.1.3 More immediate problems 

Before applying this method to real galaxies though, it 
is probably worth addressing the following more tractable 
problems first: 

(i) VPs are extracted from spec tra, which suffer from 
poorly understood systematic errors ijHoughton et al. 200f|). 

(ii) For real observations, neither individual VP velocity 
bins nor fsurprisingly) Gauss-H ermite coefficients are inde- 
pendent iHoughton et aljl200(J) . 

(iii) Most models of axisymmetric galaxies make an ad 
hoc assumption about the galaxy's three-dimensional light 
distribution j{R,z), despite the fact that there are many 
j{R,z) co nsistent with a given surfac e brightness distribu- 
tion (e.g., iKochanek fc RvbickillQQ^) . Neglect of this de- 
generacy can lead to incorrect inferences about the galaxy's 
orbit structure ( Magorrian 1999) which are likely to affect 
BH mass estimates. 

(iv) The widely used Ax^ criteria for obtaining uncer- 
tainties on BH masses are based on the assumption that 
X^(Af, ,T) is close to quadratic, which does not necessarily 
hold in practice. 
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A more realistic psf is a Gaussian with some dispersion ai,, 
for which 



piR,R') 



1 








exp 

at 


[ 2(72 \ 


lo 


fRR'\^ 



(A5) 



where lo is a Bessel function. Both these examples are sym- 
metric, but we note that we can use (Eg and ||X3|| to con- 
volve any spherically symmetric function f{R) with a psf of 
arbitrary shape. 

Now consider a single DF component ^ of energy £ 
and angular momentum J per unit mass in potential 4}{r). 
Written explicitly as a function of {x,v), 

f[x,v) = S[i,^^{vUvl+vl)-£]S[r^{vj+vl)~J^]. (A6) 

The individual orbits making up the DF component have 
peri- and apo-centre radii r± given by the roots of the equa- 
tion Vr{r) = 0, where 



(A7) 



and I have omitted the obvious dependence of the result on 
£, J and ip. The velocity moments of the component l)A6^ . 



r 2i 2j 2fei/ N _ / ,3 2i 2 



2i 2j 2k r 



,2i-l j2(j + fc) 
,,2(j + fc+l) 







if r_ < r < r+, 
otherwise. 



(A8) 



APPENDIX A: OBSERVABLES OF DF 
COMPONENTS 

I use a rectangular (x, y, z) co-ordinate system with origin O 
at the galaxy centre and whose Oz axis is parallel to lines 
of sight. Points on the plane of the sky are then labelled by 
the co-ordinates {x,y). Of course, real observations do not 
have perfect spatial resolution. Instead, any function f{x, y) 
defined on the plane of the sky is measured convolved with 
a two-dimensional point-spread function psf(Aa;, Ay): 

/meas(a;,?/) = y J psf{x ~ x' , y ~ y') f (x , y') dx' dy' . (Al) 

Since we assume that the galaxy is spherical, then f{x, y) = 
f{R), where R = ^/x^ + y^ is the usual cylindrical polar 
radius, and the psf-convolved value of / at radius R, 



/mcas(i?)= I p{R,R')fiR')R'dR', 
where 

p{R,R')= / psf (i? — i?' cos (^', J?' sin 



/ T-y/ . , /\ 1 , / 



(A2) 



(A3) 



is the azimuthally integrated contribution of light from 
radii R' to measurements at radius 7?. For example, in sec- 
tion |21 the toy galaxies are "observed" through annulii that 
admit light between some radii Tii and R2- For this situation 



Taking i — j = k — yields the luminosity density j{r) = 
2-K/r'^Vr, which has integrable singularities at both r = r_ 
and r — r+. Integrating j{r) over radius, the total luminosity 
of the component 1IA6I1 is given by 



(A9) 



which is just the usual density-of-states factor. 

Substituting j{r) into equation 1A2II . the psf-convolved 
surface brightness distribution 

m-^^l '^^'pi^^^')R'J_ ^n'^^z'^mR',.') 

''+ dr 

47r / -^::tt- I piR, r sin 9) sin ede. 



r^Vr 



(AlO) 



I evaluate this integral numerically by substituting r — 
r_ + (r+ — r_) sin^ it and applying Simpson's rule with in- 
tervals Au — 7r/200 and A9 — ir/bO. A simple test of this 
calculation is to compare the integrated surface brightness 
2-K J RI{R)dR against the density-of-states factor IIA9l l. I 
find that the two typically agree to around one part in IC*. 

The calculation of velocity profiles is a little more in- 
volved. A star at position {R, z) with velocity {vr, vo,v^) has 
projected line-of-sight velocity 



P{R,R') 



1 R2 p21 
K2 — -tti 



iiRi<R< i?2, 
otherwise. 



(A4) 



Wp = - [zvr + Rve] . 



(All) 



The luminosity density of stars located at a position (R, z) 
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having projected velocities in the range vi < Vp < V2 is 



Equation IBll becomes 



/■f2 r 
j{R, z;vi,V2)= / dvp / f(x,v)S 

J V\ J 

= E 



{zVr + Rve) — Vp 



Avp 



Rr\VrV^±y 



(A12) 



where the range of integration includes only those Vp £ 
[ui, U2] for which 



V^±{R,z;Vp) 



J' 



[rvp ±zVrf 
R-2 



(A13) 



is non-negative, and the ^_|_ takes care of both possibilities 
for the sign of Vr- Integrating along the line of sight and 
using (IA2II . the (unnormalized) psf-convolved contribution 
of the DF component l|A6|l to a VP bin that extends from 



from Vp — vi to 112 is 



L{R;vi,V2) = 2 



rdrx 



p{R, r sin 9)j{R, r cos 6; vi , V2)d9. 



(A14) 



The integral 1IXT2I for j{R, r cos 9; wi , U2) can be carried 
out by hand, but finding efficiently the regions where it is 
nonzero is far from easy. So, to evaluate the double inte- 
gral (I A 1411 I simply adopt the scheme I use to calculate I{R) 
and substitute r — r- + (r+ — r_) sin^ u: the integration over 
Vp already takes care of the worst effects of the singularities 
in the integrand of <A12l l. To test this approach, I com- 
pare the integrated VP histogram, iAv, {i + 1) An) 
for some choice of Av, against I{R)- For my standard in- 
tegration parameters, the typical RMS fractional difference 
between these two quantities is about 10~^. This merely 
indicates that very little light "leaks" from the models, not 
that its VPs are calculated to that accuracy. Comparing VPs 
calculated with different stepsizes, I estimate that this stan- 
dard integration scheme yields VPs with an RMS fractional 
error of a few parts in 10^. 

To obtain the contribution each DF component makes 
to the modifed moments Ihi{R), I calculate the component's 
VP histogram L(_R; Vi, Ui+i) for 20 bins from uo = to «24 = 
5(T, where a is the velocity dispersion used in the Gauss- 
Hermite expansion, and use this histogram in equation Q. 
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The f ollowing derivation of the prior lEJ follows ISkiliiiiS 
il998l) . Imagine a monkey iGull fc Daniell 1978.') with a bag 
of N identical stars each of luminosity L*. He sits outside 
a big box containing phase space (strictly, integral space), 
takes each star in turn and throws it into the box. The prob- 
ability of each star landing in a small volume d?xA'^v around 
the point {x,v) is (x{x , v)d? xA'^ v . If we take a small cell of 
volume 5V , the probability that r of the stars land inside 
the cell is 



(Bl) 



Now shrink the cell volume 5V and increase the num- 
ber of stars N ^ 00 keeping the product NSV constant. 



p(r|a) = — e 



(B2) 



describing a Poisson process with mean fj, = NaSV. There- 
fore, the probability of having luminosity Li in a cell of 
phase-space volume Vi is given by 

00 J, 

p(L.|pO=e-«^^5(L.-rL,), (B3) 

with jii = N Jy ad'^xd'^v ~ NaiVi. For a partition of 
phase space into n cells, the probability of the configuration 
(Li, . . . ,Ln) is given by a product of factors like IIB3II .which 
obviously satisfies the criterion 1171 for infinite divisibility. 
Apart from uninter esting scale factors, this prior is identical 
to equation (20) of ISh J 111973) . In the limit /Xi > 1 it takes 
on the entropy-like form e"*^ exp[— (L/L*)(l -I- ln(L/aI/t))]. 

The prior IBSII has the disadvantage of requiring a dis- 
cretization of the light into individual stars, with the result 
that L is very strongly peaked at whenever ^ ^ 1. For an 
alternative, suppose that we place the monkey inside the box 
representing phase space and liquify his bag of stars so that 
the contents dribble out at a constant rate. Encumbered by 
his heavy bag of stellar light, the monkey sits at one point 
(xi,Vi) dribbling starlight unless disturbed. Every so often, 
however, we squeeze his tail and he jumps to a new position 
(xi+i, Di+i) with probability density a{xi+i,Vi+i). Let the 
tail squeezes be a Poisson process with rate A, so that the 
distribution of time intervals between consecutive squeezes 
is p{t) = Ae~^*. More generally, the distribution of times 
between the i^^ and (i -I- r)* squeezes follows a gamma dis- 
tribution. Then, if the monkey lands r times in a cell of 
volume 5V , the total length of time he spends in that cell 
has probability density 



p{t\r) = 



{\5{\t), r = 0; 

\Ae-^*(At)'-V(r-l)!, r>0. 



(B4) 



In the limit of many tail squeezes, the probability that he 
lands r times in a cell of volume SV \s again given by IIB2II . 
Summing over r, the probability distribution for the length 
of time he spends in the cell is 



(B5) 



= Ae-*" [5(Af) +e"^VMAt/i(2v'At^)] , 
with the Bessel function Ji coming in through the identity 
1 



E 



r!(r- + l)! ^ 



/l(2^/^). 



(B6) 



Equation lB5t is the ID prior with F = At. 



